Decoding Methods for Text Generation with LLMs

A hands-on comparison of greedy search, beam search, sampling, top-k, top-p, and contrastive search using small pretrained models

Published

October 20, 2024

Keywords: decoding methods, text generation, greedy search, beam search, top-k sampling, top-p sampling, nucleus sampling, contrastive search, temperature, GPT-2, transformers, LLM inference

Introduction

When a pretrained language model generates text, it produces a probability distribution over the entire vocabulary at each step. The decoding method determines how the next token is selected from that distribution. This choice has a dramatic effect on output quality — the same model can produce boring repetitive text or creative human-like prose, depending entirely on the decoding strategy.

This article provides a practical, hands-on comparison of six common decoding methods using GPT-2 (124M parameters) — a small model that runs comfortably on CPU. All code examples use the Hugging Face transformers library and can be reproduced on any machine.

If you are new to running LLMs locally, check out Run LLM Locally with Ollama for a beginner-friendly setup guide. For deploying models at scale, see Deploying and Serving LLM with vLLM and Deploying and Serving LLM with Llama.cpp.

How Auto-Regressive Generation Works

All decoder-only LLMs (GPT-2, Llama, Mistral, Phi, etc.) generate text one token at a time. The probability of a word sequence is decomposed as:

P(w_{1:T} | W_0) = \prod_{t=1}^{T} P(w_t | w_{1:t-1}, W_0)

where W_0 is the initial prompt (context) and T is the generated sequence length. Generation stops when the model emits an end-of-sequence (EOS) token or a maximum length is reached.

The key question is: how do we pick w_t from P(w_t | w_{1:t-1}) at each step? That is exactly what a decoding method defines.

Setup

Install the required library and load the model:

pip install transformers torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(
    "gpt2", pad_token_id=tokenizer.eos_token_id
).to(device)

prompt = "Artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

We use GPT-2 (124M parameters) throughout — small enough to run on CPU, yet large enough to demonstrate clear differences between decoding strategies.

3. Pure Sampling

Sampling randomly picks the next token according to its probability distribution:

w_t \sim P(w | w_{1:t-1})

This introduces randomness, breaking the repetition patterns of deterministic methods.

graph TD
    A["Vocabulary distribution"] --> B["nice — 50%"]
    A --> C["dog — 30%"]
    A --> D["car — 10%"]
    A --> E["the — 5%"]
    A --> F["banana — 3%"]
    A --> G["... — 2%"]
    D -.->|"🎲 randomly selected"| H["Next token: car"]

    style D fill:#ffce67,stroke:#333
    style H fill:#ffce67,stroke:#333
    style B fill:#f8f9fa,stroke:#ccc
    style C fill:#f8f9fa,stroke:#ccc
    style E fill:#f8f9fa,stroke:#ccc
    style F fill:#f8f9fa,stroke:#ccc
    style G fill:#f8f9fa,stroke:#ccc

Every token in the vocabulary has a chance proportional to its probability. Even low-probability tokens like “car” (10%) can be selected, which adds diversity but also risk of incoherence.

set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_k=0  # disable top-k to use full vocabulary
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Pros:

  • Eliminates repetition.
  • Produces diverse, creative outputs.

Cons:

  • Can produce incoherent or nonsensical text because low-probability (weird) tokens still have a chance of being selected.

Pure sampling is rarely used in practice — the variants below (temperature, top-k, top-p) are used to make it more controlled.

4. Temperature Scaling

Temperature \tau reshapes the probability distribution before sampling. It is applied to the logits before the softmax:

P(w_i) = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}

graph TD
    subgraph low["Low temp (τ=0.3) — Sharp"]
        direction TB
        a1["nice — 80%"]
        a2["dog — 15%"]
        a3["car — 5%"]
    end
    subgraph mid["Normal (τ=1.0) — Original"]
        direction TB
        b1["nice — 50%"]
        b2["dog — 30%"]
        b3["car — 20%"]
    end
    subgraph high["High temp (τ=2.0) — Flat"]
        direction TB
        c1["nice — 35%"]
        c2["dog — 33%"]
        c3["car — 32%"]
    end

    style low fill:#56cc9d,stroke:#333,color:#fff
    style mid fill:#6cc3d5,stroke:#333,color:#fff
    style high fill:#ff7851,stroke:#333,color:#fff

Low temperature concentrates probability on the top token (more deterministic). High temperature flattens the distribution toward uniform (more random). At τ→0, it becomes greedy search.

Temperature Effect
\tau < 1 Sharpens the distribution — high-probability tokens become even more likely. Output is more focused and deterministic.
\tau = 1 No change — original distribution.
\tau > 1 Flattens the distribution — low-probability tokens get a bigger share. Output is more random and creative.
\tau \to 0 Equivalent to greedy search.
set_seed(42)
# Low temperature → more focused
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    temperature=0.3,
    top_k=0
)
print("Low temp:", tokenizer.decode(output[0], skip_special_tokens=True))

set_seed(42)
# High temperature → more creative
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    temperature=1.5,
    top_k=0
)
print("High temp:", tokenizer.decode(output[0], skip_special_tokens=True))

Temperature is typically used in combination with top-k or top-p, not alone. Common values range from 0.3 (factual) to 1.0 (creative).

5. Top-K Sampling

Top-K sampling (Fan et al., 2018) filters the vocabulary to only the K most likely tokens, then redistributes the probability mass among them.

graph LR
    A["Full vocabulary<br/>(50,257 tokens)"] --> B["Sort by<br/>probability"]
    B --> C["Keep top K=5<br/>tokens only"]
    C --> D["Renormalize<br/>probabilities"]
    D --> E["🎲 Sample from<br/>filtered set"]

    style A fill:#f8f9fa,stroke:#333
    style B fill:#6cc3d5,stroke:#333,color:#fff
    style C fill:#ffce67,stroke:#333
    style D fill:#56cc9d,stroke:#333,color:#fff
    style E fill:#78c2ad,stroke:#333,color:#fff

graph TD
    subgraph kept["Kept — Top K=5"]
        T1["nice — 0.30"]
        T2["dog — 0.25"]
        T3["big — 0.20"]
        T4["old — 0.15"]
        T5["red — 0.10"]
    end
    subgraph removed["Removed"]
        T6["the — 0.04"]
        T7["a — 0.02"]
        T8["banana — 0.001"]
        T9["... 50K+ tokens"]
    end

    style kept fill:#56cc9d,stroke:#333,color:#fff
    style removed fill:#f8f9fa,stroke:#ccc

Top-K keeps a fixed number of candidates regardless of how the probability is distributed. The removed tail tokens can never be selected.

set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_k=50
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

How it works:

  1. Compute the probability distribution over the full vocabulary.
  2. Keep only the top K tokens.
  3. Renormalize probabilities among these K tokens.
  4. Sample from the filtered distribution.

Pros:

  • Eliminates nonsensical low-probability tokens.
  • GPT-2 used top-k=40 and became famous for generating coherent stories.

Cons:

  • Fixed K does not adapt to the shape of the distribution. When the model is very confident (sharp distribution), K=50 may include garbage tokens. When the model is uncertain (flat distribution), K=50 may exclude reasonable candidates.

6. Top-p (Nucleus) Sampling

Top-p sampling (Holtzman et al., 2019) dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold p.

graph TD
    subgraph confident["Confident distribution — only 3 tokens needed"]
        direction LR
        C1["nice — 0.60"] --> C2["dog — 0.25"] --> C3["big — 0.10"]
    end
    subgraph uncertain["Uncertain distribution — 7 tokens needed"]
        direction LR
        U1["nice — 0.18"] --> U2["dog — 0.16"] --> U3["big — 0.14"] --> U4["old — 0.13"] --> U5["red — 0.12"] --> U6["the — 0.11"] --> U7["a — 0.10"]
    end

    P["p = 0.92"] --> confident
    P --> uncertain

    style confident fill:#56cc9d,stroke:#333,color:#fff
    style uncertain fill:#6cc3d5,stroke:#333,color:#fff
    style P fill:#ffce67,stroke:#333

Top-p adapts the candidate set size dynamically. When the model is confident (sharp distribution), few tokens suffice. When uncertain (flat distribution), more tokens are included. This is the key advantage over fixed Top-K.

set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_p=0.92,
    top_k=0  # disable top-k to let top-p work alone
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

How it works:

  1. Sort tokens by probability (descending).
  2. Cumulate probabilities until the sum exceeds p.
  3. Discard all tokens beyond that cutoff.
  4. Renormalize and sample.

Pros:

  • Adapts dynamically — uses fewer tokens when the model is confident, more tokens when it is uncertain.
  • Generally produces more fluent and coherent text than top-k for open-ended generation.

Cons:

  • Still non-deterministic — results vary across runs.

Combining top-k and top-p is a common practice: top-k first removes the long tail, then top-p refines the selection dynamically.

set_seed(42)
output = model.generate(
    **inputs,
    max_new_tokens=60,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=0.8
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Comparison Summary

Method Deterministic Repetition Coherence Diversity Speed
Greedy Search Yes High Medium Low Fast
Beam Search Yes Medium* Medium-High Low Medium
Pure Sampling No Low Low High Fast
Top-K Sampling No Low Medium-High Medium-High Fast
Top-p Sampling No Low High Medium-High Fast
Contrastive Search Yes Low Very High Medium Slow

*with n-gram penalty enabled

When to Use What

  • Factual / deterministic tasks (translation, summarization, code generation): Use beam search with n-gram penalty, or contrastive search.
  • Creative / open-ended generation (storytelling, dialogue, brainstorming): Use top-p + top-k + temperature sampling.
  • Maximum quality on open-ended text: Try contrastive search — it often produces the most human-like output from off-the-shelf models.
  • Quick prototyping / debugging: Greedy search is fast and reproducible; useful for sanity-checking that the model works.

Full Working Example

Below is a complete script that generates text using all six methods for side-by-side comparison:

from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained(
    "gpt2", pad_token_id=tokenizer.eos_token_id
).to(device)

prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
max_tokens = 80

print("=" * 70)
print("PROMPT:", prompt)
print("=" * 70)

# 1. Greedy Search
output = model.generate(**inputs, max_new_tokens=max_tokens)
print("\n[Greedy Search]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 2. Beam Search
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    num_beams=5, no_repeat_ngram_size=2, early_stopping=True
)
print("\n[Beam Search (5 beams, no repeat 2-gram)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 3. Pure Sampling
set_seed(42)
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    do_sample=True, top_k=0
)
print("\n[Pure Sampling]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 4. Top-K Sampling
set_seed(42)
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    do_sample=True, top_k=50
)
print("\n[Top-K Sampling (k=50)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 5. Top-p Sampling
set_seed(42)
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    do_sample=True, top_p=0.92, top_k=0
)
print("\n[Top-p Sampling (p=0.92)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

# 6. Contrastive Search
output = model.generate(
    **inputs, max_new_tokens=max_tokens,
    penalty_alpha=0.6, top_k=4
)
print("\n[Contrastive Search (alpha=0.6, k=4)]")
print(tokenizer.decode(output[0], skip_special_tokens=True))

Conclusion

The decoding method is just as important as the model itself for text generation quality. Greedy and beam search are simple and deterministic but prone to repetition. Sampling methods (top-k, top-p) introduce randomness for creative diversity but can sacrifice coherence. Contrastive search offers a compelling middle ground — deterministic, fluent, and repetition-free.

For small models like GPT-2, the choice of decoding method has an outsized impact because the model has less capacity to self-correct. Experimenting with different strategies and their hyperparameters is essential to get the best output for your specific use case.

References

Read More

  • Experiment with different small models: DistilGPT-2 (82M), Phi-2 (2.7B), or Qwen2.5-0.5B.
  • Combine decoding methods with fine-tuned models for domain-specific generation.
  • Serve your model locally with Ollama or llama.cpp and control decoding via API parameters.
  • Explore advanced techniques: speculative decoding, min-p sampling, and classifier-free guidance.